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Abstract 


The increasing expansion of digital data collected from many sources renders 
traditional storage, processing, and analysis methods obsolete. For these 
Article Info restrictions, new technologies for processing and storing very massive datasets 
have been developed. Big data processing is required to extract relevant 
information from it. Transforming data into information and knowledge is what 
Received : 19 February 2022 processing implies. Big data processing is the process of dealing with massive 
Accepted : 22 April 2022 amounts of data and changing it from its raw form into useable information in a 
Published: 05 May 2022 more understandable manner. Asa result, numerous big data processing execution 
doi: 10.51483/lJD SBDA.2.1.2022.1-9 frameworks have emerged, but determining and selecting the appropriate 
framework for processing your big data applications is a significant challenge. 
Therefore, this paper investigates the possible influence of big data challenges 
and discusses in depth the most well-known approaches to big data processing, 
which are divided into five classes: batch processing, streaming processing, real- 
time processing, interactive processing, and hybrid processing, as well as the 
variety of the most popular frameworks associated with them such as Apache 
Hadood, Dryad, Samza, IBM Infosphere, Storm, Amazon Kinesis, Drill, Impala, 
Flink, and Spark. Furthermore, this study presents a comparison among the 
several features of the frameworks by highlighting their drawbacks and strengths. 
Thus, it can be used as a guideline for picking the best application framework in 
IT analytics and will help busi-ness users make faster decisions. 
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1. Introduction 


Data is assisting us in expanding into new areas, better serving current customers, streamlining processes, 
and generating raw and analyzed data. Organizations may now utilizestructured, unstructured, and seni- 
structured data due to technological improvements. Tabular data is referred to as structured data which is 
found in relational databases or spreadsheets and accounts for only 5% of all available data (Gandomi and 
Haider, 2015). Text, images, music, video, social media, and ecommerce are examples of unstructured data. 
Semi-structured data formats do not comply with rigorous standards, ExtensibleM arkup Language, e-mail, a 
textual language for transmitting data on the World WideWeb, is often used to describe semi-structured data 
(Manjula and Prema, 2020). Data has been created at an astounding rate from millions of data sources 
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throughout the years. The production of massive amounts of datais possibly the most significant outcome of 
the digital revolution (Khalid and Y ousaf, 2021). Asindicated in Figure 1, the International Data Corporation 
inits white paper predicts that digital data will riseby approximately 80trillion gigabytes (80zettabytes) this 
year (2022) and will reach 175 zettabytes by 2025 (Reinse! et al., 2018). 


Annual Size of the Global Datasphere 175 ZB 


Figure 1; Annual Size of the Global Datasphere (R einse! et al., 2018) 


Because of increasing new services such as the internet of things, cloud computing, and location-based 
applications, the era of big datahas come(Zheng & al., 2015). Big datais described as avast volumeof data that 
necessitates the development of new technologies and architectures to gain benefit from it by recording and 
analyzing the process. Big data is vital because the more data we gather, the more accurate results and the 
ability to optimize the business processes we will have. Big datais essential for both businesses and society. 
Businesses are mostly interested in unstructured data handling. Big Data has threeprimary features known as 
the 3Vs (Volume, Variety, and Velocity). Other companies and big data specialists (engineers, academics, etc.) 
have expanded these 3Vs to 5Vs by incorporating (Valueand Veracity). Volume refers to vast volumes of any 
typeof datafrom any source. Variety refersto the many sorts of data acquired by sensors, social networks, and 
cellphones, such as photographs, datalogs, text, videos, audio, etc. Furthermore, these data might be structured 
or unstructured in type. The speed of data transmission is referred to as velocity. The process of collecting 
useful information from enormous quantities of social data is known as value extraction. The veracity of 
information relates to its completeness and accuracy. Therefore, weareunable of collecting, organizing, and 
analyzing a huge volume of data using our present data analysis software technologies (Al-Barznji and 
Atanassov, 2016). 


Sincedata has become such an important resource, there has been alot of discussion on how to effectively 
manageand exploit big data. How to analyzemassive volumes of real-time data has become a major research 
and application problem. It should be highlighted. This study will focus on big data difficulties and different 
forms of big data processing, emphasizing the variances throughout the process and accessibleframeworks. 
Therefore, the following is the structure of this paper: Section two discusses Big Data Challenges. Most 
approaches to big data processing and associated frameworks are presented in-depth in thethird section. The 
fourth section (Discussion) presents a collection of common features and compares the frameworks across 
these features. Finally, the conclusion section summarizes thestudy and discusses future directions. 


2. Challenges in Big Data 


Theissues of big data analytics aredivided into five major groups: heterogeneity and incompleteness, storage 
and analysis of data, computational complexity, data scalability and visualization, and security and privacy. 
These challenges are briefly discussed in the subsections that follows. 


2.1, H eterogeneity and Incompleteness 


Thedatain theinstanceof sophisticated heterogeneous mixed data has various rules and patterns substantially. 
Structured and unstructured data areboth possible. Organizations create 80% of their datain an unstructured 
format. It can takethe shape of graphics, email attachments, documents, health data, images, audio, video, etc. 
They may not be saved as structured data in row/ column format. This high degree of heterogeneity is a 
significant challenge for the next big data research (Al-Barznji and Atanassov, 2016; and Cuzzocrea and 
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Loria, 2021). Uncertainties arise from incomplete data during data analysis, which must be addressed. For 
certain samples, missing data field values arereferred to as incomplete data. Missing values can beinfluenced 
by alot of factors, including sensor node failure. To address these issues, numerous imputation methods are 
available (Al-Barznji and Atanassov, 2016). 


2.2. Storage and Analysis of Data 


Data size has increased tremendously in recent years due to numerous ways, such as, aerial sensory 
technologies, mobile devices, radio frequency identification readers, renotesensing, etc. These data are kept at 
great expense, only to be disregarded or erased in the end due to a lack of storage capacity. Phase Change 
Memory and Solid-State Drive were created to address the drawback of hard disk’s poor input-output 
performance. Y et, the strategies outlined above are incapable of performing these massive data procedures. 
Certainly, Hadoop and M apReduceaid in gathering alarge volumeof unstructured datain ardatively short 
period (Acharjya, 2016). 


2.3. Scalability and Visualization of Data 


Asdata volumes risequicker than CPU performance, thereisindeed a natural significant shift in processing 
techniques, with growing multiplecores being added. The goal of data visualization isto convey datamore 
effectively using graph theory approaches. Every month, onlinemarketplaces such as Amazon, and eBay have 
millions of customers and billions of products for sale. This produces a large amount of data. To that aim, 
several businesses employ the Tableau application for massive data visualization. It can convert enormous 
amounts of complicated datainto simpleimages. However, today’s largedata visualization solutionstypically 
perform poorly in terms of functionality, scalability, and reaction time. To address thisissue, moremathematical 
models should belinked to computer science (A charjya, 2016). 


2.4. Computational Complexities 


Representation and knowledge discovery arecritical issues that necessitate sub-fields. To handle problems 
and/ or requests, many combination strategies are enployed. Asthevolume of big data grows, thesestrategies 
are inefficient for obtaining significant information. Massive datasets may be managed through data marts 
and data warehouses. Large datasets necessitate more computational difficulties. Although specialized data 
related to a certain topic can be used to comprehend complexity. Oneof theprimary goals of the research is to 
reduce complications and processing costs (Shikha and Jimmy , 2018). 


2.5. Privacy and Security 


Security is one of the big data difficulties for several reasons. For starters, the big data framework comprises 
several distinct data formats, each with a particular need to be safe, and security cannot be guaranteed. 
Second, parallel data processing presents anew difficulty in which wemust ensure data security. Thethird, 
difficulty stems from real-timeanalytics: how doestheframework protect privacy whiledoing real-timeanalysis? 
Additionally, huge data are kept as distributed files in the cloud, making security increasingly complex. Data 
backup in big data systems has revealed issues of security, and the production of several replicas puts datain 
danger, and regulations specify the data that is stored, processed, and analyzed, but thereis no guaranteethat 
thisdata will besaved (A bugabita et al., 2019). 


3. Big Data Processing Frameworks 


Big Data processing is the process of dealing with massiveamounts of data and converting it from its raw form 
into auseful approach and amoreunderstandablemanner (Benjelloun et al., 2020). This section highlights the 
most powerful frameworks used to manage massiveamounts of quickly generated data. These frameworks are 
often grouped into fiveclasses and structured as follows based on their data processing approaches: Figure2 
shows batch processing, streaming processing, real-time processing, interactive processing, and hybrid 
processing. 


3.1.Batch Processing 


When datais collected or kept in big files, batch processing is utilized (Saadoon é& al., 2022). Batch processing 
isthe processing of largedata blocks that have been previously recorded in adatabase(A bugabitaeé al., 2019). 
A batch processing framework needs the collection of data over timeand the loading of all datarequired for the 
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Figure 2: Big D ata Processing Frameworks Classification 


batch into some kind of storage, such as afile system or database, to be processed. When working with huge 
amounts of data, batch processing is frequently employed (Cumbaneand Gidofalvi, 2019). Themost famous 
frameworks for this typeof processing are A pache H adoop and A pache Dryad Frameworks. 


3.1.1. H adoop Framework 


Today, ApacheH adoop isthemost popular batch data processing framework. A pacheH adoop is afreeand 
open-source Java framework for processing and querying large volumes of data on commodity hardware 
clusters. Yahoo! has madeenormous technological investments. A pache H adoop evolved into an enterprise- 
ready cloud computing platform in 2006. Its influence may be summed up in four key features. Hadoop 
delivers scalable, cost-effective, adaptable, and fault-tolerant systems. Hadoop is made up of two major 
components: H DFS which is H adoop Distributed File System and the M apReduce programming framework. 
Asaresult, thestorage systen is not physically isolated from the processing system (A|-Barznji and Atanassov, 
2016; and Otoo-A rthur and Zyl, 2020). “Google’s MapReduce and Google FileSysten” werecreated in 2004 in 
responseto the ever-increasing volumeof data on theweb. M apReduceis thefirst and native batch processing 
programming method (engine) of H adoop. It is intended for parallel processing of huge amounts of data by 
separating the work into many distinct jobs. Also, Yet Another Resource N egotiator (YA RN ), was released in 
2012 by Yahoo! and Hortonworks (Khalid and Y ousaf, 2021; Benjalloun & al., 2020; and Cumbaneand Gidofalvi, 
2019). The H adoop ecosystem also includes theH adoop kernel, as well as other components: H Base, A pache 
Hive, Oozie, Zookeeper, Pig, etc. (Al-Barznji and Atanassov, 2016). 


3.1.2. Dryad Framework 


ApacheDryad is aparallel and distributed processing framework that was initiated by Microsoftin 2004. Itis 
a powerful modulethat can improve processing capacity and grow from asmall cluster to a bigger one. This 
framework enables users to access a cluster’s resources for parallel data processing. Dryad is a highly 
sophisticated framework that includes entire tasks such as job creation, monitoring, and management, 
visualization, resource management, and fault tolerance (A bugabita ef al., 2019). 


3.2. Streaming Processing 


Stream processing is utilized when data has to be analyzed as quickly as it comes, whether it’s social data or 
machinedata from loT systems. The goal of using stream-based processing isto achieve: low latency, and real- 
timereaction to anew event (Benjdloun & al., 2020). In short, streaming data processing indicates that the data 
will be examined and activities will be made on the data as soon as possible, usually in near real-time. |BM 
Infosphere Streams and A pache Samza are the most well-known frameworks for this type of processing. 


3.2.1.1BM Infosphere Streams 


Itis an IBM framework designed to handle unlimited streams of data at high speeds. It can handle both 
unstructured and structured data streams and can be expanded to a high number of nodes. IBM streams are 
capable of processing complicated data streams at a fast pace and with extremely low latency. It comes witha 
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stream process language that allows users to construct stream applications using a high-level programming 
language (A bugabita et al., 2019). 


3.2.2. ApacheSamza 


In 2013, Samza was created by LinkedIn and contributed to the A pache Software Community later that year. 
Sama is designed to accommodate large data stream throughput (millions of messages per second) whilealso 
delivering rapid fault recovery and great dependability. Samza is now used by many large corporations, 
including LinkedIn, Uber, VMware, TripAdvisor, and N etflix (Manjula and Prema, 2020). Samza’s vision isto 
create a lightweight platform for continuous data processing. For task execution systems, platforms such as 
Apache YARN and Apache Mesos can be used. Apache YARN and Apache Kafka are built-in to 
Samza (A pache Samza, 2002). 


3.3. R eal-Time Processing 


With the advancement of technology and techniques, real-time data processing ensures that real-time data 
will be acted on time. That is, the processing is measured in milliseconds, and output is provided as soon as 
the input comes (A riyaluran et al., 2019). Real-timesystems areincredibly difficult to construct with standard 
software. Amazon Kinesis and A pache Storm arethemost well-known frameworks for this kind of processing. 


3.3.1. Amazon Kinesis 


Amazon Kinesis is a framework for distributed message queuing. It can handle massive data sets and vast 
pipelines, and the output created by Kinesis may be used by machinelearning techniques (A riyaluran et al., 
2019). Amazon Kinesis allows you to receive, store, and analyzereal-time streaming data, allowing you to get 
insights in seconds or minutes rather than hours or days. With amazingly low latency, Amazon Kinesis can 
handle any quantity of streaming data and analyze data from hundreds of thousands of sources (Amazon 
Kinesis, 2022). 


3.3.2. ApacheStorm 


Thestorm was created by “Nathan M artz of BackType’, which Twitter purchased in 2011. In 2012, thestorm 
was made open-source and was then incorporated into A pache projects in 2014 (Khalid and Y ousaf, 2021). It 
isintended to bescalable, robust, extendable, efficient, and simpleto manage. Beyond thestructure, the primary 
objectiveisto diminate message loss dueto node failures and to assureat least one processing (Cumbaneand 
Gid6falvi, 2019). Apache Stormis areal-time distributed large data processing framework developed to process 
massive volumes of data in the most fault-tolerant and horizontally scalable way possible. It uses Apache 
Zookeeper to handlethecluster stateand thedistributed environment. It accepts a raw stream of real-timedata 
at one end and processes it through asuccession of small processing units atthe other, and produces meaningful 
data at the other (Basha é& al., 2019). 


3.4. Interactive Processing 


Itis an interactive data analysis approach that allows for theinteractive querying of Big Data streams to satisfy 
the requirements in reaction time, a variety of data with a terabyte size. The user is instantly linked to the 
computer and may interact with it; datacan becompared, altered, and assessed ina visual or tabulated format, 
or both at the same time. The essential issue in interactive processing is dealing with little jobs, which are 
separated into Map/ Reduce and are inefficient to deal with (Abugabita et al., 2019). The most common 
frameworks inthis category areA pachel mpala and A pache Drill. 


3.4.1. Apachelmpala 


Impala, an open-sourceSQL enginethat operates on hundreds of computers as a distributed architecture, is 
assessed in 2015. Impala has a Massively Parallel Processing (MPP) enginethat outperforms Hive and Spark 
SQL. Impala provides adequate performance by using aggregations, scans, and joins to provide queries; It has 
afailure tolerance and low latency; impala data is saved in Parquet files; this aoachedoes not utilize H adoop 
but instead installs a collection of modules on each Data N ode for local processing; this method is designed to 
prevent bottleneck difficulties. Impala has a relatively low run time and is provided by HiveQL (Abugabita 
et al., 2019). 
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3.4.2. Apache Drill 


ApacheDrill is adistributed system for interactive big data analysis. It is a distributed query engine for large- 
scale datasets with minimal latency, including structured and semi-structured. Drill, inspired by Google’s 
Dremel, is intended to expand to thousands of nodes and query petabytes of data at the interactive rates 
required for BI/ Analytics settings. The ‘Drillbit’ serviceis at the heart of A pache Drill, responsible for taking 
client requests, executing queries, and providing results to theclient. Drill reies on Zookeeper to keep track of 
cluster membership and health-check data. The drill is compatible with a wide range of file systems, and 
NoSQL databases including M ongoDB, H Base, M apR-DB, MapR-FS, HDFS, GoogleCloud Storage, Amazon 
S3, AzureBlob Storage, NAS, Swift, and local files. Data from several data stores can bejoined in asingle query 
(Apache Drill, 2022). 


3.5. Hybrid Processing 


Theframeworks in hybrid processing can be used for morethan oneform of data processing, which means 
that they support both batch data processing and stream data processing. Apache Spark and Apache Flink are 
the most well-known frameworks for this type of processing. 


3.5.1. Apache Flink 


ApacheFlink is amodern framework for distributed processing and intensive streaming analytics. Itis alarge- 
scale data processing framework for the next generation that is meant to operate with low latency and high 
throughput in all common cluster setups (Toliopoulos et al., 2020). Flink was founded in 2009 as Stratosphere 
at the Technical University of Berlin. Stratosphere became an open-source project in 2014 as an Apache 
incubator project called “Flink”. Flink is capable of processing data 100 times quicker than M apReduce. Flink 
is primarily a stream processing engine that does not provide its storage or resource management system. 
Flink provides two fundamental A Pls: the DataSet A PI for processing bounded data streams (batch processing) 
and the DataStream A PI for potentially unbounded data streams (A pacheFlink, 2022). 


3.5.2. Apache Spark 


Apache Spark is an open-sourcelarge data processing platform designed for speed and complex analytics. It’s 
simpleto useand was created in 2009 at UC Berkeley’s A M PLab. It was made availableas an A pache project 
in 2010. Spark allows you to easily create apps in Scala, Java, or Python (Apache Spark, 2022). Spark is an 
extrenedy powerful tool that can handlelarge-size datasets that arestructured, semi-structured, or unstructured 
inavariety of ways. It can handledatain batches or streams. Spark includes M Llib and Spark ML (Pipelines 
API). Spark MLIib is Spark’s implementation of Resilient Distributed Datasets based on machine learning 
methods. The new Spark ML is built on top of the Spark dataset API. It has tremendous scalability and 
exceptional usability. The pipeline is a sophisticated capability included with Spark ML. The processing 
speed of Hadoop M apReduceis slow sinceit needs disk access for reads & writes. Spark, on the other hand, 
stores datain memory, decreasing the read or write cycles (Ahmed é& al., 2020). In memory, Spark can run 
applications up to hundreds of times faster than H adoop M apReduce, while on disk, it can run applications 
ten times faster (Al-Barznji and Atanassov, 2018). For huge dataset tasks, Spark is preferable more than 
MapReduce. Spark has been embraced as a processing enginefor handling Big Data challenges by numerous 
research disciplines, like pattern mining and machine learning, dueto its diverse capabilities (Hicham and 
Anis, 2021). 


4. Discussion 


This section presents a collection of common features discovered throughout this research and compares the 
frameworks across these features, as summarized in Table 1. Asshown here all theexplained frameworks are 
open-source except for the Dryad framework, which is closed-source, and they areframeworks for massively 
distributed or/ and parallel processing. M oreover, according to thebig data processing types, theframeworks 
are grouped, and most of theframeworks’ computation modes arein memory, and their latency is low, butthe 
computation modes of Hadoop and Dryad frameworks areon thedisk, and they havehigh latency. In addition, 
for processing speed, the Flink framework has thefastest processing, but H adoop is theslowest one, and all 
the mentioned frameworks arehighly fault-tolerant; only A pache Drill has low fault-tolerant, etc. That will be 
avery useful guidefor selecting the best suitable frameworks based on their characteristics for handling and 
processing different big datasets or applications efficiently, and so on. 
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Table 1: Compares the Features of Big D ata Processing Frameworks 
Frameworks Hadoop Dryad Spark Flink Storm Samza Drill Impala 
Features 
Major Google, Microsoft | Berkeley's Apache BackType, | LinkedIn Google's Marcel 
Backers Yahoo! AMPLab Software Twitter Dremel Kornacker 
Foundation 
Open Source Yes Yes Yes Yes Yes Yes Yes Yes 
Big Data Batch Batch Hybrid Hybrid Real-time | Streaming | Interactive} Interactive 
Processing (Batch and | (Batch and 
Stream) Stream) 
Resource YARN YARN Stand- | Stand-alone,} YARN, Stand- | Zookeeper YARN 
Manager alone, YARN, Mesos alone, 
YARN, Mesos YARN, 
Mesos Mesos 
Storage HDFS_ | Distributed) ~HDES, HDFS, HDFS HDFS HDFS, HDFS or 
File HBase, streams HBase, HBase 
System Hive, databases and Hive 
(DFS) Casan-dra 
Data Sources | HDFS Computer] DBMS, DBMS, Spout Kafka HDFS, | HDFS, HBase 
cluster or a] HDFS, and | HDFS, and Hive, 
data center] = Kafka Kafka RDBMS, 
HBase, 
and 
MongoDB 
Computing |Disk-based Disk- In memory | In memory | In memory] In memory|!n memory| In memory 
Mode based 
Processing Slow Fast Fast Very Fast Fast Fast Fast Fast 
Speed 
Execution |MapReduc] Directed Resilient Data flow | Topology DAG Query HiveQL, 
Model Acyclic | Distributed graph execution Massively 
Graph Dataset "pipeline Parallel 
(DAG (RDD), model" Processing 
DAG (MPP) 
Scalability High High Moderate High Moderate Low High High 
Fault Yes Yes Yes Yes Yes Yes Yes (Low) Yes 
Tolerance 
Latency High High Low Very Low | Very Low | Very Low Low Low 
Throughput/ High High High High Medium High High High 
Performance 
Implementa- Java C++ Scala Java, Scala Clojure | Scala, Java] Java SQL 
tion 
Languages 
Supported C, C++, | .Net (C#, Java, Java, Scala, Any JVM SQL and | All languages 
Programming | Perl, Ruby,}| VB, etc.) Scala, R, R, and Programm-}| Languages] Alternative! supporting 
Languages PHP, and Python ing (Java, Sca- Query JDBC/ ODBC 
Python, Python Language la) Languages 


etc.). 
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5. Conclusion and Future Work 


In the existence of multiple different big data processing frameworks, choosing the best-suited framework 
based on the application and environment is difficult. Certainly, no explicit guidance is offered to assist 
developers and users in selecting an appropriate framework for their project. To address this research gap, this 
study intends to present a comprehensive compilation of the most prominent big data processing frameworks, 
highlighting the benefits and shortcomings of each framework. Furthermore, this study explored what big 
data meant from the outset and highlighted the most impacted sources responsiblefor creating data volume, 
as well as thebig data features and certain big data issues. Then, in detail, the most well-known methodologies 
of big data processing frameworks were explored and classified into five classes: batch processing, streaming 
processing, real-time processing, interactive processing, and hybrid processing. Asaresult, it may be used as 
a guide for determining the optimal framework for an application, IT analytics, assisting researchers, and 
readers, as well as business users, in making faster and moreinformed decisions, enhancement, promoting 
innovativework, and implementation of such upcoming beneficial frameworks soon. Future work will involve 
experiments on large data sets via each framework and comparing results to determine their efficiency in 
handling large volumes of data. 
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